NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Contextual Document Embeddings

Morris, John X; Rush, Alexander M (April 2025, ICLR)

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
more » « less
Free, publicly-accessible full text available April 24, 2026
Integrating the Study of Polyploidy Across Organisms, Tissues, and Disease

https://doi.org/10.1146/annurev-genet-111523-102124

Morris, John P; Baslan, Timour; Soltis, Douglas E; Soltis, Pamela S; Fox, Donald T (November 2024, Annual Review of Genetics)

Polyploidy is a cellular state containing more than two complete chromosome sets. It has largely been studied as a discrete phenomenon in either organismal, tissue, or disease contexts. Increasingly, however, investigation of polyploidy across disciplines is coalescing around common principles. For example, the recent Polyploidy Across the Tree of Life meeting considered the contribution of polyploidy both in organismal evolution over millions of years and in tumorigenesis across much shorter timescales. Here, we build on this newfound integration with a unified discussion of polyploidy in organisms, cells, and disease. We highlight how common polyploidy is at multiple biological scales, thus eliminating the outdated mindset of its specialization. Additionally, we discuss rules that are likely common to all instances of polyploidy. With increasing appreciation that polyploidy is pervasive in nature and displays fascinating commonalities across diverse contexts, inquiry related to this important topic is rapidly becoming unified.
more » « less
Full Text Available
Language Model Inversion

Morris, John X; Zhao, Wenting; Chiu, Justin T; Shmatikov, Vitaly; Rush, Alexander M (May 2024, ICLR)

Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of 59 and token-level F1 of 78 and recovers 27% of prompts exactly. Code for reproducing all experiments is available at this http URL.
more » « less
Full Text Available
Text Embeddings Reveal (Almost) As Much As Text

https://doi.org/10.18653/v1/2023.emnlp-main.765

Morris, John; Kuleshov, Volodymyr; Shmatikov, Vitaly; Rush, Alexander (January 2023, Association for Computational Linguistics)
Tree Prompting: Efficient Task Adaptation without Fine-Tuning

https://doi.org/10.18653/v1/2023.emnlp-main.384

Singh, Chandan; Morris, John; Rush, Alexander; Gao, Jianfeng; Deng, Yuntian (January 2023, Association for Computational Linguistics)

Full Text Available
White matter hyperintensity longitudinal morphometric analysis in association with Alzheimer disease

https://doi.org/10.1002/alz.13377

Strain, Jeremy Fuller; Phuah, Chia‐Ling; Adeyemo, Babatunde; Cheng, Kathleen; Womack, Kyle B.; McCarthy, John; Goyal, Manu; Chen, Yasheng; Sotiras, Aristeidis; An, Hongyu; et al (October 2023, Alzheimer's & Dementia)

Abstract INTRODUCTIONVascular damage in Alzheimer's disease (AD) has shown conflicting findings particularly when analyzing longitudinal data. We introduce white matter hyperintensity (WMH) longitudinal morphometric analysis (WLMA) that quantifies WMH expansion as the distance from lesion voxels to a region of interest boundary. METHODSWMH segmentation maps were derived from 270 longitudinal fluid‐attenuated inversion recovery (FLAIR) ADNI images. WLMA was performed on five data‐driven WMH patterns with distinct spatial distributions. Amyloid accumulation was evaluated with WMH expansion across the five WMH patterns. RESULTSThe preclinical group had significantly greater expansion in the posterior ventricular WM compared to controls. Amyloid significantly associated with frontal WMH expansion primarily within AD individuals. WLMA outperformed WMH volume changes for classifying AD from controls primarily in periventricular and posterior WMH. DISCUSSIONThese data support the concept that localized WMH expansion continues to proliferate with amyloid accumulation throughout the entirety of the disease in distinct spatial locations.
more » « less
Full Text Available
Discovery of target genes and pathways at GWAS loci by pooled single-cell CRISPR screens

https://doi.org/10.1126/science.adh7699

Morris, John A.; Caragine, Christina; Daniloski, Zharko; Domingo, Júlia; Barry, Timothy; Lu, Lu; Davis, Kyrie; Ziosi, Marcello; Glinos, Dafni A.; Hao, Stephanie; et al (May 2023, Science)

INTRODUCTION Genome-wide association studies (GWASs) have identified thousands of human genetic variants associated with diverse diseases and traits, and most of these variants map to noncoding loci with unknown target genes and function. Current approaches to understand which GWAS loci harbor causal variants and to map these noncoding regulators to target genes suffer from low throughput. With newer multiancestry GWASs from individuals of diverse ancestries, there is a pressing and growing need to scale experimental assays to connect GWAS variants with molecular mechanisms. Here, we combined biobank-scale GWASs, massively parallel CRISPR screens, and single-cell sequencing to discover target genes of noncoding variants for blood trait loci with systematic targeting and inhibition of noncoding GWAS loci with single-cell sequencing (STING-seq). RATIONALE Blood traits are highly polygenic, and GWASs have identified thousands of noncoding loci that map to candidate cis -regulatory elements (CREs). By combining CRE-silencing CRISPR perturbations and single-cell readouts, we targeted hundreds of GWAS loci in a single assay, revealing target genes in cis and in trans . For select CREs that regulate target genes, we performed direct variant insertion. Although silencing the CRE can identify the target gene, direct variant insertion can identify magnitude and direction of effect on gene expression for the GWAS variant. In select cases in which the target gene was a transcription factor or microRNA, we also investigated the gene-regulatory networks altered upon CRE perturbation and how these networks differ across blood cell types. RESULTS We inhibited candidate CREs from fine-mapped blood trait GWAS variants (from ~750,000 individual of diverse ancestries) in human erythroid progenitors. In total, we targeted 543 variants (254 loci) mapping to candidate CREs, generating multimodal single-cell data including transcriptome, direct CRISPR gRNA capture, and cell surface proteins. We identified target genes in cis (within 500 kb) for 134 CREs. In most cases, we found that the target gene was the closest gene and that specific enhancer-associated biochemical hallmarks (H3K27ac and accessible chromatin) are essential for CRE function. Using multiple perturbations at the same locus, we were able to distinguished between causal variants from noncausal variants in linkage disequilibrium. For a subset of validated CREs, we also inserted specific GWAS variants using base-editing STING-seq (beeSTING-seq) and quantified the effect size and direction of GWAS variants on gene expression. Given our transcriptome-wide data, we examined dosage effects in cis and trans in cases in which the cis target is a transcription factor or microRNA. We found that trans target genes are also enriched for GWAS loci, and identified gene clusters within trans gene networks with distinct biological functions and expression patterns in primary human blood cells. CONCLUSION In this work, we investigated noncoding GWAS variants at scale, identifying target genes in single cells. These methods can help to address the variant-to-function challenges that are a barrier for translation of GWAS findings (e.g., drug targets for diseases with a genetic basis) and greatly expand our ability to understand mechanisms underlying GWAS loci. Identifying causal variants and their target genes with STING-seq. Uncovering causal variants and their target genes or function are a major challenge for GWASs. STING-seq combines perturbation of noncoding loci with multimodal single-cell sequencing to profile hundreds of GWAS loci in parallel. This approach can identify target genes in cis and trans , measure dosage effects, and decipher gene-regulatory networks.
more » « less
Full Text Available
The origin of blinking in both mudskippers and tetrapods is linked to life on land

https://doi.org/10.1073/pnas.2220404120

Aiello, Brett R.; Bhamla, M. Saad; Gau, Jeff; Morris, John G.; Bomar, Kenji; da Cunha, Shashwati; Fu, Harrison; Laws, Julia; Minoguchi, Hajime; Sripathi, Manognya; et al (May 2023, Proceedings of the National Academy of Sciences)

Blinking, the transient occlusion of the eye by one or more membranes, serves several functions including wetting, protecting, and cleaning the eye. This behavior is seen in nearly all living tetrapods and absent in other extant sarcopterygian lineages suggesting that it might have arisen during the water-to-land transition. Unfortunately, our understanding of the origin of blinking has been limited by a lack of known anatomical correlates of the behavior in the fossil record and a paucity of comparative functional studies. To understand how and why blinking originates, we leverage mudskippers (Oxudercinae), a clade of amphibious fishes that have convergently evolved blinking. Using microcomputed tomography and histology, we analyzed two mudskipper species, Periophthalmus barbarus and Periophthalmodon septemradiatus , and compared them to the fully aquatic round goby, Neogobius melanostomus . Study of gross anatomy and epithelial microstructure shows that mudskippers have not evolved novel musculature or glands to blink. Behavioral analyses show the blinks of mudskippers are functionally convergent with those of tetrapods: P. barbarus blinks more often under high-evaporation conditions to wet the eye, a blink reflex protects the eye from physical insult, and a single blink can fully clean the cornea of particulates. Thus, eye retraction in concert with a passive occlusal membrane can achieve functions associated with life on land. Osteological correlates of eye retraction are present in the earliest limbed vertebrates, suggesting blinking capability. In both mudskippers and tetrapods, therefore, the origin of this multifunctional innovation is likely explained by selection for increasingly terrestrial lifestyles.
more » « less
Full Text Available
South-to-north migration preceded the advent of intensive farming in the Maya region

https://doi.org/10.1038/s41467-022-29158-y

Kennett, Douglas J.; Lipson, Mark; Prufer, Keith M.; Mora-Marín, David; George, Richard J.; Rohland, Nadin; Robinson, Mark; Trask, Willa R.; Edgar, Heather H.; Hill, Ethan C.; et al (December 2022, Nature Communications)

Abstract The genetic prehistory of human populations in Central America is largely unexplored leaving an important gap in our knowledge of the global expansion of humans. We report genome-wide ancient DNA data for a transect of twenty individuals from two Belize rock-shelters dating between 9,600-3,700 calibrated radiocarbon years before present (cal. BP). The oldest individuals (9,600-7,300 cal. BP) descend from an Early Holocene Native American lineage with only distant relatedness to present-day Mesoamericans, including Mayan-speaking populations. After ~5,600 cal. BP a previously unknown human dispersal from the south made a major demographic impact on the region, contributing more than 50% of the ancestry of all later individuals. This new ancestry derived from a source related to present-day Chibchan speakers living from Costa Rica to Colombia. Its arrival corresponds to the first clear evidence for forest clearing and maize horticulture in what later became the Maya region.
more » « less
Full Text Available
Covariance-based vs. correlation-based functional connectivity dissociates healthy aging from Alzheimer disease

https://doi.org/10.1016/j.neuroimage.2022.119511

Strain, Jeremy F; Brier, Matthew R; Tanenbaum, Aaron; Gordon, Brian A; McCarthy, John E; Dincer, Aylin; Marcus, Daniel S; Chhatwal, Jasmeer P; Graff-Radford, Neill R; Day, Gregory S; et al (November 2022, NeuroImage)

Full Text Available

« Prev Next »

Search for: All records